arXivil506.06405v2 [stat.ME] 25 Sep 2015 


Combining and Extremizing Real-Valued Eorecasts 
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Abstract 

The weighted average is by far the most popular approaeh to eombining multiple fore¬ 
easts of some future outeome. This paper shows that both for probability or real-valued 
forecasts, a non-trivial weighted average of different forecasts is always sub-optimal. More 
specifically, it is not consistent with any set of information about the future outcome even 
if the individual forecasts are. Furthermore, weighted averaging does not behave as if it 
collects information from the forecasters and hence needs to be extremized, that is, sys¬ 
tematically transformed away from the marginal mean. This paper proposes a linear ex- 
tremization technique for improving the weighted average of real-valued forecasts. The 
resulting more extreme version of the weighted average exhibits many properties of opti¬ 
mal aggregation. Both this and the sub-optimality of the weighted average are illustrated 
with simple examples involving synthetic and real-world data. 
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1. INTRODUCTION 


Policy-makers often eonsult human or/and maehine agents for foreeasts of some future out- 
eome. For instance, multiple economics experts may provide quarterly predictions of gross 
domestic product (GDP). Typically it is not possible to determine ex-ante which expert will be 
the most aeeurate, and even if this eould be done, heeding only the most aeeurate expert’s ad¬ 
vice would ignore a potentially large amount of relevant information that is being contributed 
by the rest of the experts. Therefore a better alternative is to eombine the foreeasts into a single 
eonsensus foreeast that represents all the experts’ adviee. The poliey-makers, however, ean 
ehoose to aggregate the forecasts in many different ways. The final ehoiee of the eombination 
rule is erueial beeause it often deeides how mueh of the experts’ total information is ineorpo- 
rated and henee how well the eonsensus foreeast performs in terms of predictive aecuraey. 

Possibly because of its simplicity and intuitive appeal, the most popular approaeh to com¬ 
bining forecasts is the weighted average, sometimes also known as the linear opinion pool. 
This technique has a long tradition, with many empirieal studies attesting to its benefits (see, 
e.g., [Bates and Granger|1969]|Clemen]1989[|Armstrong|2001[ ). Even though the average fore¬ 
east does not always outperform the best single foreeaster (Hibon and Evgeniou[ 20051, it is 


still considered state-of-the-art (Elliott and Timmermann 20131 in many fields, ineluding eco¬ 


nomies (Blix et al. 20011, weather foreeasting (Raftery et al. 20051, politieal seienee (Graefea 


et al. 2014|), and many others. In this paper, however, we show that non-trivial weighted 


av¬ 


eraging is suboptimal, and propose a simple transformation to improve it. A more detailed 
deseription of the eontributions is given below. 

In praetiee foreeasts are typically either real-valued or probabilities of binary events, sueh 
as rain or no rain tomorrow. Ranjan and Gneitin^ (2010) foeus on the latter and explain how 
the quality of a probability foreeast (individual or aggregate) is typieally measured in terms 
of reliability and resolution (sometimes also known as calibration and sharpness, respeetively). 
Reliability describes how elosely the eonditional event frequeneies align with the foreeast prob- 
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abilities. Resolution, on the other hand, measures how far the foreeasts are from the naive 
baseline foreeast, that is, the marginal event frequeney. A foreeast that is reliable and highly 
resolute is very useful to the policy-maker because it is both accurate and close to the most 
confident values of zero and one. Therefore a well-established goal in probability forecasting 
is to maximize resolution subject to reliability (Murphy and Winkler[ 1987; Gneiting et al.[ 


2007). 


Strikingly, |Ranj an and Gneiting ( |2010| ) prove that any non-trivial weighted average of two 
or more different, reliable probability forecasts is unreliable and lacks resolution. In particular, 
they explain that such a weighted average is under-confident in a sense that it is overly close 
to the marginal event frequency. This result is an important contribution to the probability 
forecasting literature in part because it points out a dramatic shortcoming of methodology that 
is used widely in practice. However, the authors neither provide a principled way of addressing 
the shortcoming nor interpret potential causes of the under-confidence. 

The first step towards addressing these issues and improving the general practice of aggre¬ 
gation is to understand what is meant by principled aggregation. This topic was discussed by 
Satopaa et al.|(|2015a|b[) who propose the partial information framework as a general platform 


for modeling and combining forecasts. Under this framework, the outcome and the forecasts 
share a probability space but without any restrictions on their dependence structure. Any fore¬ 
cast heterogeneity is assumed to stem purely from information available to the forecasters and 
how they decide to use it. For instance, forecasters studying the same (or different) articles 
about the state of the economy may use distinct parts of the information and hence report dif¬ 
ferent predictions of the next quarter’s GDR Even though, to date, this framework has been 
mainly used for constructing new aggregators, it also offers an ideal environment for analyz¬ 
ing other, already existing, aggregation techniques. No previous work, however, has used it to 
study weighted averaging of probability or real-valued forecasts. 

The first contribution of this paper leaves the type of forecasts unspecified and analyzes the 
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weighted average of any univariate foreeasts under the partial information framework. The re¬ 
sults are general and eneompass both probability and real-valued foreeasts. First, the aforemen¬ 
tioned result in Ranjan and Gneiting| (2010) is generalized to any type of univariate foreeasts. 
This result shows, for instanee, that any non-trivial weighted average of reliable predietions 
about the next quarter’s GDP is both unreliable and under-eonfident. Seeond, some general 
properties of optimal aggregation are enumerated. This leads to an original point of view on 
foreeast aggregation, general, yet intuitive, deseriptions of well-known properties sueh as reli¬ 
ability and resolution, and an introduetion of a new property, ealled variance expansion, that 
is assoeiated with aggregators whose varianee is never less than the maximum varianee among 
the individual foreeasts. Sueh aggregators are ealled expanding and can be considered to col¬ 
lect information from the individual forecasters. Showing that a non-trivial weighted average is 
never expanding leads to a mathematically precise yet easy-to-understand explanation of why 
weighted averages tend to be under-confident. This reasoning suggests that under-confidence 
is not unique to the class of weighted averages but extends to many other measures of central 
tendency, such as the median, that also tend to reduce variance. 

In probability forecasting the under-confidence of a simple aggregator, such as the average 
or median, is typically alleviated by a heuristic known as extremizing, that is, by systematically 


transforming the aggregate towards its nearer extreme (at zero or one). For instance, Ranjan 


and Gneiting] (|20 10) propose a beta transformation that extremizes the weighted average of the 


probability forecasts; Satopaa et al. (2014) use a logistic regression model to extremize the 
average log-odds of the forecasts; many others, including Shlomi and Wallsten (2010|), Baron 


et al. (|2014|), and Mellers et al. (2014), have also discussed extremization of probability fore¬ 


casts. Intuitively, extremization increases confidence by explicitly moving the aggregate closer 
to the most confident values of zero and one. Naturally, the same intuition applies to probability 
forecasts of any categorical outcome. However, if the outcome and forecasts are real-valued, it 
is not clear anymore what values represent the most confident forecasts. Consequently, it seems 
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that extremization, as described above, lacks direction and cannot be applied. Furthermore, the 
idea of extremizing may seem counter-intuitive given the large amount of literature attesting 


to the benefits of shrinkage (James and Stein 19611. These may be the main reasons why, to 
the best of our knowledge, no previous literature has discussed extremization of real-valued 
forecasts. 

Therefore it is perhaps somewhat surprising that our second contribution shows that ex¬ 
tremizing can improve aggregation also when the individual forecasts are real-valued. First, 
the notion of extremizing is made precise. This involves introducing a general definition that 
differs slightly from the above heuristic. In particular, extremizing is redefined as a shift away 
from the least confident forecast, namely the marginal mean of the outcome, instead of towards 
the most confident (potentially undefined) values. Second, our definition and theoretical anal¬ 
ysis motivate a convex optimization procedure that linearly extremizes the optimally weighted 
average of real-valued forecasts. The technique is illustrated on simple examples involving 
both synthetic and real-world data. In each example extremizing leads to improved aggrega¬ 
tion with many of the optimal properties enumerated in the beginning of the analysis. 

The rest of the paper is structured as follows. Sectionj^briefly introduces the general partial 
information framework and discusses some properties of the optimal aggregation within that 
framework. The class of weighted averages is then analyzed in the light of these properties. 
Section describes the optimization technique for extremizing the weighted average of real¬ 
valued forecasts. Section [^illustrates this technique and our theoretical results over synthetic 
data. Section repeats the analysis over real-world data. The final section concludes and 
discusses future research directions. 
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2. FORECAST AND AGGREGATION PROPERTIES 


2.1 Optimal Aggregation 

Consider N forecasters and suppose forecaster j predicts Xj for some (integrable) quantity 
of interest Y. The partial information framework assumes that Y and Xj, for j = 1,..., N, 
are measurable random variables under some common probability space (fl,P). Akin to 
Murphy and Winkler] ( |1987| ), |Ranjan and Gneitin^ ( |2010[ ), |Jolliffe and Stephenson| ( |2012[ ), and 


many others, the forecasters are assumed to be reliable, that is, conditionally unbiased such that 
'£.{Y\Xj) = Xj for all j = 1,..., N. To interpret this assumption, observe that the principal 
(j-field X holds all possible information that can be known about Y. Each reliable forecast Xj 
then generates a sub-a-field cr(Xj) := Xj C X such that Xj = '£.{Y\Xj). Conversely, suppose 
that Xj = E.{Y\Xj) for some Xj C X, then 


E(F|X,) = E[E{Y\Xj,Xj)\X^] = E[E{Y\Xj)\X^] = E(X,-|X,) = X,-. 


Therefore a forecast is reliable if and only if it represents the optimal use of some information 
set, that is, it is consistent with some partial information Xj C X. Given that at this level of 
specificity the framework is highly general and hence likely to be a good approximation of 
real-world prediction polling, it offers an ideal platform for analyzing different aggregators. 

In this paper an aggregator is defined to be any forecast that is measurable with respect to 
X" := cr{Xi, ..., Xjv), namely the a-field generated by the individual forecasts. For the sake 
of notational clarity, aggregators are denoted with different versions of the script symbol X. If 
E(F^) < oo, the conditional expectation X" := E.{Y\X'') minimizes the expected quadratic 
loss among all aggregators (see, e.g., |Durrett|2010| ). This forecast is called the revealed aggre¬ 
gator because it optimally utilizes all the information that the forecasters’ reveal through their 
forecasts. Even though X" is typically too abstract to be applied in practice, it provides an 
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optimal baseline for aggregation effieiency. Therefore studying its properties gives guidance 
for improving aggregators currently used in practice. Some of these properties are summarized 
in the following theorem. The proof is deferred to the Appendix. 

Theorem 2.1. Suppose that Xj = E{Y\Xj) for all j = 1,..., and denote the revealed 
aggregator with X” = E{Y\X''), where T” = o'(Xi, ..., X^). Let 5max '■= maXj{Var {Xf)} 
be the maximal variance among the individual forecast. Then the following holds. 

i) Marginal Consistency. X” is marginally consistent: E{X'') = E(y) := /iq. 

ii) Reliability. X” is reliable: E{Y\X'') = X”. 


in) Variance Expansion. X" is expanding: 6max < Var {X"). In words, the variance of X" 
is always at least as large as that of the most variable forecast. 

Marginal consistency states that the forecast and the outcome agree in expectation. If Xj 
is reliable, then E(Xj) = E[E(y|Xj)] = E(y) = /iq. Consequently, all reliable forecasts 
(individual or aggregate) are marginally consistent. The converse, however, is not true. For in¬ 
stance, Theorem |2.2| (see Section|2^ shows that any non-trivial weighted average is marginally 
consistent but unreliable. This is an important observation because it provides a technique for 
proving lack of reliability via marginal inconsistency - a task that is generally much easier than 
disproving reliability directly. 

Given that each reliable forecast can be associated with a sub-cr-field and that condi¬ 
tional expectation is a contraction in Lf (Durrett 2010 Theorem 5.1.4.), the variance of 
any reliable forecast (individual or aggregate) is always upper-bounded by Var(y). The¬ 
orem 2.1 further shows that the corresponding lower bound for Var (A"') is the maximum 


variance among the forecasters. To interpret this lower bound, consider an increasing se¬ 
quence of cr-fields Xq = C Xi C ■■■ C Xr C X and the corresponding fore¬ 

casts = E(V|J>) for r = 0,1,..., i?. According to Satopaa et al.| (2015a Proposi¬ 
tion 2.1), the variances of these forecasts respect the same order as their information sets: 
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Var (Xq) < Var (Xi) < ■ ■ ■ < Var (X/j) < Var (X). This suggests that the amount of in¬ 
formation used in a reliable foreeast is refleeted in its varianee. Naturally, if an aggregator 
eolleets information from a group of foreeasters, it should use at least as mueh information as 
the most informed individual foreeaster; that is, its varianee should exeeed that of the individ¬ 
ual foreeasters’. Therefore any aggregator that expands varianee and satisfies this eondition is 
eonsidered a eolleetor of information. 

Reeall that in probability foreeasting a well-established goal is to maximize resolution sub- 
jeet to reliability. This goal ean be easily interpreted intuitively with the help of partial infor¬ 
mation. First, eonditioning on reliability requires the foreeast to be eonsistent with some set 
of information about Y. Maximizing the resolution of this foreeast takes it as far from pq as 
possible. This is equivalent to inereasing the varianee of the foreeast as elose to the theoretieal 
upper bound Var (X) as possible. Therefore the goal is equivalent to maximizing the amount of 
information that the foreeast is eonsistent with. Intuitively, this is very reasonable and should 
be eonsidered as the general goal in foreeasting. 

2.2 Weighted Averaging 

The rest of the paper analyzes the most eommonly used aggregator, namely the weighted av¬ 
erage. The following theorem shows that a non-trivial weighted average is neither expanding 
nor reliable and therefore ean be eonsidered suboptimal. The proof is again deferred to the Ap¬ 
pendix. A similar result does not hold for all linear eombinations of the individual foreeasts. 
For instanee, Seetionj^desoribes a model under whieh the optimal aggregator X" is always a 
linear eombination of the individual Xj’s. 

Theorem 2.2. Suppose that Xj = E(X|Xj) for j = 1,. .. ,N. Denote the weighted average 
with X^ := where Wj > 0, for all j = 1,...,X, and 

m = argmax^{Var (Xj)} identify the forecast with the maximal variance 5max = Var{Xm)- 
Then the following holds. 
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i) Xyj is marginally consistent. 


ii) Xw is not reliable, that is, P [E(F| ^ X^,] > 0 if there exists a forecast pair i j such 

that P(Xj Xj) > 0 and Wi,Wj > 0. In words, X^ is necessarily unreliable if it assigns 
positive weight to at least two different forecasts. 

in) Under the conditions of item ii), Xyj lacks resolution. More specifically, ifXf := E{Y\X.u,) 
is the reliable version of X^, then E{Xw) = E{Xf) = pq but Var {X^^) < Var (Xf). In 
other words, X^, is under-confident in a sense that it is closer to the marginal mean po 
than its reliable version Xf. 

iv) Xw is not expanding. In particular, Var (V^) < 5max, which shows that X^ is under¬ 
confident in a sense that it is as close or closer to the marginal mean pq than the revealed 
aggregator X”. Furthermore, Var (V^) = Var {X") if and only if both Xyj = X” = X^; 
that is, Xm provides all the information necessary for X”, and X^, assigns all weight to 
Xm (or to a group of forecasts all equal to Xm). 


This theorem diseusses under-eonfidenee under two different baselines. Item|^ is a gen¬ 
eralization of Ranjan and Gneiting ( 2010[ Theorem 2.1.). Intuitively, it states that if X^ is 
trained to use its information aeeurately, the resulting aggregator is more eonfident. Therefore 
under-eonfidenee is defined relative to the reliable version of Xy^. Under this kind of eompar- 
ison, however, a reliable aggregator is never under-eonfident. For instance, an aggregator that 
ignores the individual forecasts and always returns the marginal mean /xq is reliable and hence 
would not be considered under-confident. Intuitively, however, it is clear that no aggregate 
forecast is more under-confident than the marginal mean pq. To address this drawback, item 
[r^ defines under-eonfidenee relative to the revealed aggregator instead. Such a comparison 
estimates whether the weighted average is as confident as it should be given the information 
it received through the forecasts. Item shows that this happens only if all the weight is 
assigned to a forecaster whose information set contains every other forecasters’ information. 
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However, even if eould pick out the most informed forecaster ex-ante, the chances of a 
single forecaster knowing everything that the rest of the forecasters know is extremely small 
in practice. In essentially all other cases, is under-confident, unreliable, and hence not 
consistent with some set of information about Y. 

Unfortunately, this shortcoming spans across all measures of central tendency. These aggre¬ 
gators reduce variance and hence are separated from the revealed aggregator by the maximum 


variance among the individual forecasts. For instance, Papadatos (1995) discuss the maxi¬ 
mum variance of different order statistics and show that the variance of the median is upper 
bounded by the global variance of the individual forecasts. Given that such aggregators are not 
expanding, they cannot be considered to collect information. To illustrate, consider a group 
of forecasters, each independently making a probability forecast of 0.9 for the occurrence of 
some future event. If these forecasters are using different evidence, then clearly the combined 
evidence should give an aggregate forecast somewhat greater than 0.9. In this simple scenario, 
however, measures of central tendency will always aggregate to 0.9. Therefore they fail to 
account for the information heterogeneity among the forecasters. Instead, they reduce “mea¬ 
surement error,” which is philosophically very different to the idea of information aggregation 
discussed in this paper. 


Theorem 2.2, however, is not only negative in nature; it is also constructive in several 
different ways. First, it motivates a general and precise definition of extremizing: 


Definition 2.3. Extremization. Consider two reliable forecasts X* and Xj. Denote their com¬ 
mon marginal mean with E(Xj) = E(Xj ) = /iq. The forecast Xj extremizes Xi if and only if 
either Xj < Xj < /iq or < X^ < Xj always holds. 


It is interesting to contrast this definition with the popular extremization heuristic in the con¬ 


text of probability forecasting. Definition 2.3 suggests that simply moving, say, the average 
probability forecast closer to zero or one improves the aggregate if and only if the marginal 
probability of success is 0.5. In other cases naively following the heuristic may end up de- 
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grading the aggregate. For instanee, eonsider a geographieal region where rain is known to 
oeeur on 20% of the days. If the average probability foreeast of rain tomorrow is 0.30, in¬ 
stead of following the heuristie and shifting this aggregate towards zero and henee eloser to 
the marginal mean of 0.20, the aggregate should be aetually shifted in the opposite direetion, 
namely eloser to one. Seeond, Theorem |2.2| suggests that extremization, as defined formally 
above, is likely to improve the weighted average of any type of univariate foreeasts. This justi¬ 
fies the eonstruction of a broader elass of extremizing teehniques. In partieular, the seeond part 


of item iv I states that extremizing is likely to improve the weighted average when the single 
most informed foreeaster knows a lot less than all the foreeasters know as a group. To illustrate 
this, the next section introduces a simple optimization procedure that extremizes the weighted 
average of real-valued forecasts. 


3. EXTREMIZING REAL-VALUED EORECASTS 

Estimating the weights and the amount of extremization requires the forecasters to address 
more than one related problems. For instance, they may participate in separate yet similar 
prediction problems or give repeated forecasts on a single recurring event. Across such prob¬ 
lems the weights and the resulting under-confidence are likely to remain stable, allowing the 
aggregator parameters to be estimated based on multiple predictions per forecaster. Therefore, 
from now on, suppose that the forecasters address K > 2 problems. Denote the outcome of 
the kth problem with G K and let Xjk E K represent the jth forecaster’s prediction for this 
outcome. 

Extremization requires at least two parameters: the marginal mean, which acts as the pivot 
point and decides the direction of extremizing, and the amount of extremization itself. Extrem¬ 
ization, of course, could be performed in many different ways. However, if denotes the 
extremized version of the weighted average for the kth problem, then probably the simplest 
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and most natural starting point is the following: 


XI = a (w'Xfc - /io) + /io, 


where = (Xi^.,..., X^kY colleets the foreeasts for the fcth outeome, w = (wi,..., w^y is 
the weight veetor, and a G (1, oo) (or a G [0,1)) leads to extremization (or eontraetion towards 
/io, respeetively). If a = 1, then X* is equal to the weighted average X^j. This linear form 
is partieularly eonvenient beeause it leads to effleient parameter estimation and also maintains 
marginal consisteney of X^,; that is, E(A’*) = /iq for all values of a. However, Var(A’*) 
increases in a such that Var {X*) = a^Var (X^) > Var {Xy^) for all a > 1. Therefore, for a 
large enough a, X* is both marginally consistent and expanding. These properties hold even 
if the weighted average is replaced by some other marginally consistent aggregator. However, 


given that the main purpose of this procedure is to illustrate Theorem 2.2 this paper only 
considers the weighted average. 

Recall that the forecasts are assumed calibrated and hence marginally consistent with the 
outcomes. Therefore an unbiased estimator of the prior mean fiQ is given by the average of the 
forecasts J2k=i ^jk ob alternatively, by the average of the outcomes Yl!k=i ^k- Es¬ 
timating /io in this manner, however, leads to a two-step estimation procedure. A more direct 
approach is to estimate all the parameters, namely a, /io, and w, jointly over some criterion. 
If Yfc has an explicit likelihood in terms of X*, then the parameters can be estimated by max¬ 
imizing this likelihood. Assuming an explicit parametric form, however, can be avoided by 
recalling from Section [2)^ that the revealed aggregator X” utilizes the forecasters’ information 
optimally and minimizes the expected quadratic loss among all functions measurable with re¬ 
spect to X”. Ideally, X* would behave similarly to X”. Therefore it makes sense to estimate its 
parameters by minimizing the average quadratic loss over some training set. Section]^ shows 
that this is likely to improve both the resolution and reliability of the weighted average. 
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These eonsiderations lead to the following estimation problem: 


K 

minimize E [a (w'Xfc - /io) + /io - Ykf 

k=l 


subjeet to Wj > 0 for j = 1,..., iV, 

N 


J=1 


and 


a > 0. 


( 1 ) 


To express this problem in a form that is more amenable to estimation, denote an x iV 
identity matrix with Iat, a veetor of K ones with 1^, and a veetor of N zeros with Otv- If Y = 
(Yi,..., Yk)', X = {Ik, (Xi,..., Xa-)'), and A = (Oat, Iat), then problem Q is equivalent 
to 

minimize -/3'X'X/3 - Y'X/3 

2 ( 2 ) 

subjeet to — A/3 < Oat, 

where the inequality is interpreted element-wise and /3 is a veetor of A +1 optimization param¬ 
eters. Given that X'X is always positive semidefinite, problem ([^ is a eonvex quadratie pro¬ 
gram that ean be solved effieiently with standard optimization teehniques. If /3* = {(3^,... ,(3]^)' 
represents the solution to Q, the optimal values of the original parameters ean be reeovered 
by 

N 

“' = E/^;. 

i=i 

w* = 13*/a* for j = 1,..., A, and 
^l = -Pl/{l-a*). 
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The next two seetions apply and evaluate this method both on simulated and real-world data. 


4. SIMULATION STUDY 


This seetion illustrates Theorem 2.2 on data generated from the Gaussian partial information 


model introdueed in |Satopaa et al.| ( [2015a|b| ) as a elose yet praetieal speeifieation of the general 
partial information framework. The simplest version of this model oeeurs when the outeome 
V and the foreeasts Xj are real-valued with mean zero. The observables for the fcth problem 
are then generated jointly from the following multivariate Gaussian distribution: 




A'lt 

~ A/tv+I 

\Xjsik y 



( 


( 


0 , 


1 diag(S)' 
diag(S) S 


V 


V 



5i 

52 . 

5n 


5i 

Pi,2 • 

■ Pl,N 

52 

P2,l 

52 . 

■ P2,N 

6n 

PN,1 

PN,2 ■ 

6n 




, (3) 


// 


where the eovarianee matrix deseribes the information structure among the foreeasters. In 
partieular, the maximum amount of information is 1.0. The diagonal entry 5j G [0,1] represents 
the amount of information used by foreeaster j sueh that if 5j = 1 (or 5j = 0), the foreeaster 
always reports the eorreet answer (or the marginal mean /xq = 0, respeetively). The off- 
diagonal pij, on the other hand, ean be regarded as the amount of information overlap between 
foreeasters i and j. Using the well-known properties of a eonditional multivariate Gaussian 
distribution, [Satopaa et al.| ( |2015a|b[ ) show that under this model the foreeasts are reliable and 
that the revealed aggregator for the fcth problem is Xf = E(U.|Xfc) = diag(S)'S“^Xfc. 

The distribution Q is partieularly useful beeause it provides a realistie model for testing 
aggregation under different information struetures. This seetion eonsiders N = 5 foreeasters 
under two different struetures: 
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(a) No Information Overlap 


(b) High Information Overlap 


Figure 1: Information Distribution Among N = 5 Forecasters. The top bar next to Full 
Information represents all possible information that can be known about Y. The bar leveled 
horizontally with Forecaster j represents the information used by that forecaster. 


No Information Overlap. Fix 6j = 0.1 + 0.02j for j = 1,..., 5 and let pij = 0 for all 
i,j. Therefore the forecasters have independent information sources. This information 
structure is illustrated in Figure Summing up the individual variances shows that 
as a group the forecasters know 80% of the total information. The revealed aggregator 
reduces to Xj! = has variance 0.80, and therefore efficiently uses all the 

forecasters’ information. 

High Information Overlap. Fix 5j = 0.1 + 0.02j for j = 1,..., 5 and let =0.12 
for all f, j. Therefore the forecasters have significant information overlap and as a group 
know only 32% of the total information. This information structure is illustrated in Fig- 
The revealed aggregator reduces to Xf = ~ has variance 

0.32, and therefore efficiently uses all the forecasters’ information. 

The competing aggregators are the equally weighted average X, the optimally weighted 
average X^, the extremized version of the optimally weighted average X*, and the revealed 


ure 
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Table 1: Synthetic Data. Estimated parameter values. 


Scenario 

Forecast 

/^o 

a 

Wi 

W2 

Ws 

W4 

W5 

No Overlap 

Xyj 

X* 

0.0004 

5.0137 

0.0000 

0.1964 

0.1080 

0.2023 

0.2293 

0.2008 

0.3025 

0.2006 

0.3601 

0.2000 

High Overlap 

x^ 

X* 

-0.0077 

1.3048 

0.0000 

0.0000 

0.0000 

0.0000 

0.0440 

0.1456 

0.4262 

0.3959 

0.5298 

0.4585 


aggregator X". The parameters in X* and X^ are first estimated by minimizing the average 
quadratic loss over a training set of 10, 000 draws from Q. After this, all the competing aggre¬ 
gators are evaluated on an independent test set of another 10, 000 draws from Q. Therefore all 
the following results, apart from the parameter estimates, represent out-of-sample performance. 

In probability forecasting the quality of the predictions is typically assessed using a relia¬ 
bility diagram. The idea is to first sort the outcome-forecast pairs into some number of bins 
based on the forecasts and then plot the average forecast against the average outcome within 
each bin. Figures]^ and [^generalize this to continuous outcomes by replacing the conditional 
empirical event frequency with the conditional average outcome. The bins are chosen so that 
they all contain the same number of forecast-outcome pairs. The vertical dashed line repre¬ 
sents the marginal mean fiQ = 0. The plots have been scaled such that the identity function 
shows as the diagonal. Any deviation from this diagonal suggests lack of reliability. The grey 
area represents the reliability diagrams of a 1, 000 bootstrap samples of the forecast-outcome 
pairs. Therefore it serves as a visual guide for assessing uncertainty. The inset histograms help 
to assess resolution by comparing the empirical distribution of the forecasts against the prior 
distribution of Y, namely the standard Gaussian distribution represented by the red curve. In 
particular, if the forecast is reliable, then the closer its empirical distribution is to the standard 
Gaussian, the more information is being used in the forecast. 

Figures [^ and [^ present the reliability diagrams for X" under no and high information 
overlap, respectively. Comparing these plots to the corresponding reliability diagrams of X 
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(a) A' (b) (c) X* (d) X" 

Figure 2: Synthetic Data. Out-of-sample reliability under no information overlap. 





(a) A' (b) X^ (c) X* (d) X" 

Figure 3: Synthetic Data. Out-of-sample reliability under high information overlap. 


and Xy^ in the same figures, reveals that X and Xy, are not only unreliable but also have smaller 
variance than X'' . Furthermore, the manner in which the plotted points deviate from the di¬ 
agonal suggests that X and Xy, are under-confident in both information scenarios. The level 
of under-confidence is particularly startling in Figures and but decreases as information 
overlap is introduced in Figures]^ and 3b Given that averaging-like techniques do not behave 
like information aggregators, that is, they are not expanding, it is not surprising to see them 
perform better under high information overlap when aggregating information is less important 
for good performance. Table shows the parameter estimates for Xy, and X*. The weights 
in Xy, increase in the forecaster’s amount of information and differ noticeably from the equal 
weights employed by X. More importantly, however, in both information scenarios a > 1. 
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This reflects the need to correct the under-confidence of The resulting A”* is more reliable 
and confident as can be seen in Figures and 3c Furthermore, it behaves very similarly to 
the optimal aggregator A" under both information structures. 

In addition to performing visual assessment, the aggregators can be compared based on 
their out-of-sample average quadratic loss. To make this specific, let Y = (Yi,..., Yk) collect 
all the outcomes of the testing problems and A = (Ai,..., Xk) be a vector of some aggregate 
forecasts for the same problems. Then, the average quadratic loss for this aggregator is 


K 


L (Y, X) = - y] (n - Xtf 


k=l 


If the forecasts are probability estimates of binary outcomes, the above loss is known to have a 


decomposition that permits a closer analysis of reliability and resolution (Brier 1950; Murphy 


1973[ ). The decomposition, however, is not limited to probability forecasts. To see this, suppose 
that the real-valued aggregate A^ G {/i,...,//} for some finite number I. Let Ki be the 
number of times /* occurs, % be the empirical average of {Y^ : A^ = fi], and Y = J2k=i 
Then, 


L{v,x) = ^Yl 

i=l i=l k=l 


(4) 


REL 


RES 


UNC 


See the Appendix for the derivation of this decomposition. The three components of the de¬ 
composition are highly interpretable. In particular, low REL suggests high reliability. If the 
aggregate is reliable, then RES is approximately equal to the sample variance of the aggregate 
and is increasing in resolution. The final term, UNC does not depend on the forecasts. This 
is the sample variance of Y and therefore gives an approximate upper bound on the variance 
of any reliable forecast. As has been mentioned before, the goal is to maximize resolution 
subject to reliability. This decomposition shows how the quadratic loss addresses reliability 
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Table 2: Synthetic Data. The average quadratic loss, L(Y, X) with its three additive compo¬ 
nents: reliability (REL), resolution (RES), and uncertainty (UNC). The final column, gives 
the estimated variance of the forecast. 


Scenario 

Eorecast 

L{Y,X) 

REE 

RES 

UNC 



Best Individual 

0.8024 

0.0050 

0.2108 

1.0081 

0.200 


Median 

0.7322 

0.2928 

0.5688 

1.0081 

0.046 

No Overlap 


0.7185 

0.5140 

0.8036 

1.0081 

0.032 

Xyj 

0.7016 

0.2913 

0.5979 

1.0081 

0.055 


X* 

0.1971 

0.0022 

0.8132 

1.0081 

0.799 


X” 

0.1969 

0.0021 

0.8132 

1.0081 

0.807 


Best Individual 

0.8141 

0.0061 

0.2195 

1.0275 

0.199 


Median 

0.8492 

0.0087 

0.1870 

1.0275 

0.125 

High Overlap 

A 

0.8254 

0.0137 

0.2157 

1.0275 

0.128 

X^ 

0.7889 

0.0166 

0.2552 

1.0275 

0.150 


X* 

0.7758 

0.0056 

0.2573 

1.0275 

0.228 


X” 

0.6837 

0.0057 

0.3496 

1.0275 

0.318 


and resolution simultaneously and therefore provides a convenient loss function for learning 
aggregation parameters. 

Table presents the quadratic loss, its additive components, and the estimated variance 
for each of the different forecasts under both information scenarios. In addition to the 
aforementioned X, X^, X*, and X", the table also presents scores for the median forecast and 
the individual forecaster with the lowest quadratic loss. Even though the best individual is 
reliable by construction, it is highly unresolute and hence gains an overall poor quadratic loss. 
Under no information overlap, however, this individual is better than both the median and X 
because these aggregators assign too much importance to the individual forecasters with very 


little information. As predicted by Theorem 2.2 the median and the averaging aggregators X 
and Xw are neither reliable nor expanding. The remaining two aggregators, namely X* and X”, 
on the other hand, are reliable and expanding. Table[^shows that X* is in fact almost equivalent 
to X” under no information overlap. Under high information overlap, however, X” gains slight 
advantage over X*. In this case X* cannot take the same form as X”. Consequently, it has 
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an estimated varianee of 0.228 whieh is well below the amount of information known to the 
group, namely 0.320. It fails to use information optimally beeause it eannot subtraet off the 
shared information Xi and henee avoid double-eounting of information. However, despite it 
using information less effieiently, it is as reliable as X”. 

Of eourse, under the Gaussian model, X* may seem redundant beeause the optimal X” ean 
be eomputed direetly. In praetiee, however, S is not known and must be estimated under a 
non-trivial semidefinite eonstraint (see Satopaa et al.||20I5a for more details). Given that this 
involves a total of ('^) + N parameters, the estimation task is ehallenging even for moderately 
large N, say, greater than 100. Furthermore, aeeurately estimating sueh a large number of pa¬ 
rameters requires the foreeasters to attend a large number of predietion problems. Applying 
X* instead is signifieantly easier beeause it involves only + 1 parameters that ean be esti¬ 
mated via a standard quadratie program Q. Therefore this aggregator seales better to large 
groups of foreeasters. On the other hand, problem Q requires a training set with known out- 
eomes whereas S ean be learned from the foreeasts alone. Therefore the two aggregators serve 
somewhat different purposes and should be eonsidered eomplementary rather than eompetitive. 


5. CASE STUDY: CONCRETE COMPRESSIVE STRENGTH 


Conerete is the most important material in eivil engineering. One of its key properties is eom- 
pressive strength that depends on the water-to-eement ratio but also on several other ingre¬ 
dients. Yeh ( |1998 ) illustrated this by statistieally predieting eompressive strength based on 
age and seven mixture ingredients. The assoeiated dataset is freely available at the UC Irvine 


Maehine Learning Repository (Liehman 20131 and eonsists of 1,030 observations with the 
following information: 
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mA 


A4 p < 




(5) 


M.2 


Y : Compressive Strength 
vi : Cement (kg in a vA mixture) 

V2 : Coarse Aggregate (kg in a rrt’ mixture) 
fs ; Fly Ash (kg in a vA mixture) 

Vi : Water (kg in a m? mixture) 

^5 ; Superplasticizer (kg in a m? mixture) 

^6 ; Fine Aggregate (kg in a m? mixture) 
v-r : Blast Furnace Slag (kg in a vY mixture) 
ns : Age (days) 

This particular dataset is appropriate for illustrating our results because it is simple yet large 
enough to allow the computation of reliability diagrams and the individual components of the 
average quadratic loss. 

The individual forecasters are emulated with three linear regression models, M.i, A42, and 
Ada, that predict V based on different sets of predictors. In particular, model Adi only uses pre¬ 
dictors vi,V 2 ,V 3 ,V 4 , whereas model Ad 2 uses the remaining predictors ns, ng, ny, ng. Therefore 
their predictor sets are non-overlapping. The third model Ads uses the middle four predictors 
'V 3 ,Vi,V 3 ,Ve, and hence has significant overlap with the other two models. The results are com¬ 
pared against a linear regression model Ad p that has access to all eight predictors. This is not 
an aggregator and only represents the extent to which the predictors can explain the outcome 
Y. Therefore it provides interpretation and scale. The predictor sets corresponding to the dif¬ 
ferent models are summarized by the curly braces in Q. Overall, this setup can be viewed 


as a real-valued equivalent of the case study in [Ranjan and Gneitin^ ( |2010| ) who aggregate 
probability forecasts from three different logistic regression models. 

The evaluation is based on a 10-fold cross validation. The models Adi, Ad 2 , and Ads are 
first trained on one half of the training set and then used to make predictions for the second 
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(a) Ml {h)M 2 (c)M 3 (d)MF 

Figure 4: Real-World Data. Out-of-sample reliability of the individual models. 



Figure 5: Real-World Data. Out-of-sample reliability of aggregators under no information 
overlap. 

half and the entire testing set. Next, the aggregators are trained on the models’ predictions over 
the second half of the training set. Finally, the trained aggregators are tested on the models’ 
predictions over the testing set. Therefore all the following results, apart from the parame¬ 
ter estimates, represent out-of-sample performance. Similarly to Section the evaluation is 
performed separately under two different information structures: the No Information Overlap 
scenario considers only predictions from models Mi and M 2 , whereas the High Information 
Overlap scenario involves only predictions from models Mi and M 3 . 

Figures and [^present the reliability diagrams of the individual models and the aggre- 
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Figure 6: Real-World Data. Out-of-sample reliability of aggregators under high information 
overlap. 

gators under no and high information overlap, respeetively. Unlike in Section the marginal 
distribution of V is not known. Therefore the red curve over the inlined histogram represents 
the empirical distribution of V. Similarly, the dashed vertical line represents the sample aver¬ 
age of the outcomes instead of the marginal mean /iq. According to these plots, the individual 
forecasts are mostly reliable, except at extremely small or large forecasts. The averaging ag¬ 
gregators X and A’u,, on the other hand, are both unreliable and under-confident. Similarly to 
Section and in accordance with Theorem |2.2i this under-confidence decreases as the fore¬ 
casters’ information overlap increases from Figure]^ to Figure]^ Tablegives the parameter 
estimates for and X*. These aggregators employ very similar weights. In both informa¬ 
tion scenarios a > 1, suggesting that X^j is under-confident and should be extremized as it 
is. Based on Figures [5^ and [6^ the resulting aggregator X* is noticeably more reliable and 
appears to approximate the empirical distribution of Y quite closely. Simply based on visual 
assessment X* performs as well as Aip under low information overlap but loses some resolu¬ 
tion once overlap is introduced. This makes sense because the models considered in the high 
information overlap scenario, namely M.i and Ada have access only to the first six predictors 
while Ad F uses all eight predictors and hence should have a higher level of information. 
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Table 3: Real-World Data. Estimated parameter values. 


Scenario 

Eorecast 

Ao 

a 

Wi 

W2 

No Overlap 

Xyj 

X* 

-36.2051 

1.6950 

0.5327 

0.5269 

0.4673 

0.4731 

High Overlap 

Xyj 

X* 

-37.6776 

1.4382 

0.5931 

0.5375 

0.4069 

0.4625 


Table 4: Real-World Data. The average quadratie loss, L(Y, AT) with its three additive eompo- 
nents: reliability (REL), resolution (RES), and uneertainty (UNC). The final eolumn, gives 
the estimated varianee of the foreeast. 


Scenario 

Eorecast 

L{Y,X) 

REE 

RES 

UNC 



Ml 

187.80 

9.70 

100.72 

278.81 

82.83 


M 2 

185.74 

12.01 

105.08 

278.81 

92.51 


Ms 

197.03 

12.81 

94.59 

278.81 

73.27 


Mi F 

110.91 

9.46 

177.36 

278.81 

157.87 


X 

155.69 

30.99 

154.10 

278.81 

56.33 

No Overlap 

Xw 

156.32 

31.45 

153.94 

278.81 

56.21 


X* 

133.23 

9.86 

155.45 

278.81 

161.89 


X 

177.45 

16.77 

118.13 

278.81 

61.92 

High Overlap 

Mw 

176.59 

14.37 

116.59 

278.81 

63.32 


X* 

169.92 

8.20 

117.09 

278.81 

128.69 


Table [^provides a numerieal eomparison by presenting the average quadratie loss, its ad¬ 
ditive components, and the estimated variance for the individual models and the competing 
aggregators. Given that all aggregators perform better than the individual forecasters, aggre¬ 
gation is generally beneficial. However, there are large performance differences among the 
aggregators. In particular, the variances of X and do not exceed that of the individual 
forecasters’, suggesting that neither of them is expanding. Eurthermore, they are much less 
reliable than the individual forecasters. In contrast, X* is able to maintain the forecasters’ 
level of reliability. Even though this aggregator is expanding, it is less resolute and has a lower 
variance than Aip under high information overlap. This can be expected because in the high 
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information overlap scenario has access only to a subset of the information that A4jr uses. 
Under no information overlap, all the predictors are used by the individual forecasters, but this 
does not mean that this information is actually revealed to X* through the reported forecasts. 


6. SUMMARY AND DISCUSSION 


This paper discussed forecast aggregation under a general probability model, called the partial 
information framework. The forecasts and outcomes were assumed to have a joint distribution 
but no restrictions were placed on their dependence structure. The analysis led to an enumer¬ 
ation (Theorem |2.1| ) of several properties of optimal aggregation. Even though the optimal 
aggregator is typically intractable in practice, its properties provide guidance for developing 
and understanding other aggregators that are more feasible in practice. In this paper these 
properties shed light on the class of weighted averages of any type of univariate forecasts. 
Even though these averages are marginally consistent, they fail to satisfy two of the optimal¬ 
ity properties, namely reliability and variance expansion (Theorem |2.2| ). As a result, they are 
under-confident in a sense that they are overly close to the marginal mean. This shortcoming 
can be naturally alleviated by extremizing, that is, by shifting the weighted average further 
away from the marginal mean. Section introduced a simple linear procedure (Equation [T]) 
that extremizes the weighted average of real-valued forecasts and maintains marginal consis¬ 
tency. This procedure and the theoretical results were illustrated on synthetic (Section and 
real-world data (Sectionj^. In both cases the optimally weighted average was shown to be both 
unreliable and under-confident, especially when the forecasters used very different sets of in¬ 
formation. Eortunately, extremization was able to largely correct these drawbacks and provide 
transformed aggregates that were both reliable and more resolute. 

Eorecast aggregation literature by and large agrees that the goal is to collect and combine 
information from different forecasters (see, e.g., Dawid et al.|1995t Armstron'^|200 1 [ Eorlines 
et al.||2012[). At the same time aggregation continues to be performed via weighted averaging 
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or perhaps some other measure of eentral tendeney, sueh as the median (Levins, 1966[[Arim 


strong[ |2001^ |Lobo and Yao[ |2010| ). Seetion |2.2| explained that these popular teehniques do 


not behave like aggregators of information. Instead, they are designed to reduee measurement 
error whieh is philosophieally very different from information diversity ( Satopaa et al.[|2015a ). 
Therefore some details of their workings seem to have been misunderstood. Unfortunately, it is 
unlikely that this paper will prevent aggregation with measures of eentral tendeney all together. 
However, it is hoped that our eontributions will at least prompt interest and provide direetion 
in diseovering alternative aggregation teehniques. 

This paper illustrated that good information aggregation ean arise from a simple linear 
transformation that extremizes the weighted average. Of eourse, under a large number of pre- 
dietion problems, a non-linear extremizing funetion ean lead to further improvements in ag¬ 
gregation. The linear funetion, however, is a simple and natural starting point that suffiees for 
illustrating the benefits of extremizing. Is extremizing then guaranteed to be benefieial in every 
predietion task? Probably not. Therefore, for the sake of applieations, it is important to diseuss 
eonditions under whieh extremizing is likely to improve the eommonly used aggregators. Item 
iv) of Theorem 2.2 and the empirieal results in Seetions and suggest that extremizing is 


likely to be more benefieial under no or low information overlap. This aligns with Satopaa 


et al. (2015bI who use the Gaussian partial information model to show empirieally that ex¬ 


tremizing probability foreeasts beeomes more important a) as the amount of the foreeasters’ 
eombined information inereases, and b) as the foreeasters’ information sets beeome more di¬ 
verse. This means that, for instanee, the average foreeast of team members working in elose 
eollaboration require little extremizing whereas foreeasts eoming from widely different sourees 
must be heavily extremized. 

Unfortunately, the amount and direetion of extremization depends on a training set with 
known outeomes. Sueh a training set may not always be available. In the most extreme ease 
the deeision-maker may have only a set of foreeasts of a single unknown outeome. How should 
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the forecasts be aggregated in such a low-data setting? The results in this paper suggest that 
any type of weighted average (or some other measure of central tendency) is a poor choice. A 
better alternative was discussed by Satopaa et al. ( 2015b[ ). They assume that the forecasters’ 
covariance matrix is compound symmetric and then aggregate the probability forecasts with the 
optimal aggregator under the corresponding Gaussian partial information model. Developing 
more general aggregators that place less constraints on the joint dependence structure while 
satisfying at least two of the optimality properties of Theorem 2T] is certainly an interesting 
future research direction. The first step is to develop a simple aggregator that is both marginally 
consistent and expanding. Finding an aggregator that maintains forecasters’ reliability seems 
more difficult. 


A. APPENDIX 


A.l 


Proof of Theorem 


2.1 


i) The law of total expectation gives: 


E(A’") = E[E(y|A’")] = E(y) = /io. 

ii) Recall that X” = E(F|J^"), X" G 7", and 7" = a{X ^,..., X^). Then, 

E(F|T’") 

= E[E{Y\X", 7'')\X''] (as X" G 7") 

= E[E{Y\7")\X''] 

= E{X''\X'') 

= X”. 
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iii) This relies on the observation that cr{Xm) = Xm ^ = cr(Xi ,..., X^r). Then, 


^max — Var (^Xfji') 

= E{Xi)-^^l 
= E[E{Y\Xm)X^] - fil 
= E{E[E{Y\r')\Xm]X„,}-fil 
= E[E{X"\XjX^]-^il 
= E[E(X"X^|Xj]-/i2 
= E(A’"Xj-/i2 
= E[{X"-fio){X^-fio)] 

< ^ XaT { X ") 6max 


(asX™ = E(F|Xj) 
(the smallest a-field wins) 

(reverse iterated expeetation) 

(by the Cauehy-Sehwarz inequality). 


Squaring and diving both sides by S^ax gives the desired result. 


□ 


A.2 Proof of Theorem 


2.2 


Items ii) and iii) are generalizations of the proof in Ranjan and Gneiting (2010) 


i) This follows from direet eomputation: 


E(X^) = E(w'X) = w'E(X) = /iowT^ = /io. 


ii) Consider some reliable aggregate X sueh that E(F|T:’) = X. Then, 

E[(F - X)2] 

= E{E [{Y -Xy\X]} 
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= E[E {Y^ - 2YX + X^\X)] 

= E [E {Y‘^\X) - A’2] 

= E(F2) _ e(A’2). 

The rest of the proof shows that if A’ = = w'X, then the above identity cannot 

hold. This gives a contradiction and hence proves the desired result. First, note that 

Eii E7 =i WiWj = 1. Then, 

E [(F - X^f] 


E 

(F - w'X)^ 



r 

■ N 

1 

E< 

1 

1 


[ 

( 


1 

-i=i 

J 


N N 

v)(y-v)l 

i=l j=l 
N N 

= EE WiWjE [Y^ - YXi - YXj + XjXi) 

i=l j=l 
N N 

= EE-*- ,E [E - E (FX,|X,) - E (FX,|X,) + X^X,] 

i=l j=l 
N N 

= J2Y1 - Xf -Xf + X^Xi\ 

i=l j=l 
N N 

= EE»*- ,E [E (F^ix,) + {X,X, - X,X,) - X^ - X] + X,Xi\ 

i=l j=l 
N N 

= EE-*- ,E [E {Y^\X,) - X,X, - {X, - X^f] 

i=l j=l 

N N N N 

= J2Y1 - EE] - [(E - E)'] 

2=1 j=l 2=1 j = l 
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N N N N 

= E (f2) - ^^•)'] 

2=1 j = l 2=1 j = l 

N N 

= E {Y^) - E (w'XX'w) - EE-<-.e 

i=l j=l 
N N 

= [E (Y^) - E (X^)] - 5^ ^WiU,,E [(A'. - Xjf] . 

i=l j=l 


This leads to a contradiction because the double sum on the final line is strictly positive as 
long as there exists a forecast pair i ^ j such that P(Xj ^ Xj) >0 and Wi,Wj > 0. 


iii) The fact that E(T’^) = /tq follows similarly to the proof of item i) of Theorem 2.1 This 
item continues under the conditions of the previous item. Therefore it can be assumed that 
Xu, is not calibrated, that is, P{Xl^ Xu,) > 0. Then, 


E [(!' - A'„)"] 

= E (y" - 2yA’„ + x^) 

= E (y2 + 2 (x:^ - xS) - 2YX^ + x^) 
= E (y^ - 2yA': + 2XS - 2XiX„ + x^) 


= E 


/ \2 


- K) 


+ E 


/ 


- K) 


= E(E2) _ e(T’;2) + E 
> E(E2) _ e(a;2). 


(because is reliable) 


Furthermore, from the previous item, E [(A — A^,)^] < E(y^) — E(A^). Putting this all 
together gives 


E(A2) - e(a:2) < E(A2) - E(A2) 
^ e(a:2) - > e(a2) - 
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Var(A’:)>Var(A'2). 


iv) The fact that Var (A’^) < Smax follows from direct computation: 

Var(Aj = E[(/io-A^)2] 

= E(A2) - 

= w'E(XX')w — w'Iat/XqI'^w 
= w' [E(XX') - fillNl'N] w 
= w'E[(X - l7v/io)(X - ljv/io)']w 
= w'Cov (X)w 

^ ^max 1JV W 

^max- 

To see the identity part of the statement, note that 

Var (A^) = w'Cov (X)w = Cov (X^, Xj), 

where Wij = WiWj e [0,1] and ^ First, suppose that Var (X^) = 

^max > Var (Xj) = 5i for all i ^ m. Then, if wu > 0 for some i ^ m, the term 
WiiCov (Xj, Xj) brings Var (A^) below 5max- This decrease cannot be compensated by 
any other term because no element in Cov (X) is larger than 5max- Consequently, it must 
be case that Wi = 0 for all i ^ m. Now, if there exists j ^ m such that 5j = 5max and wj > 
0, then Var (A^) = 6max only if all weight is given to X^ and Xj, and Cov (X^, X^) = 
Smax- This covariance implies that Corr (X^-, X^) = 1. Thus, <j{Xj) = a{Xm) and hence 
that Xj = E[V|(T(Xj)] = E[Y\a{Xm)] = Xm- Consequently, Var (Au,) = 5max only if all 
weight is distributed among Xj such that Xj = X^. 
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From the Theorem 2.1 6max < Var {X"), where the inequality arises from the Cauehy- 
Sehwarz inequality. It is well-known that this reduees to an equality if and only if X” and 
Xm are linearly dependent. Sueh a linear dependenee would imply that (t{X'') = a{Xm) 
and henee that X^ = E[y |cT(Xm)] = E[F|(t(T:’")] = X". Now, if there exists j ^ m sueh 
that 5j = 6max, then by the same argument ^{X”) = a{Xm) = (^{Xj) and eonsequently 
X,- =Xm = X". 

Putting this all together gives that w'X = X” if and only if a(Xm) = (^(X") and Wi > 0 
only for all Xi = Xm- 


□ 


A.3 Derivation of Equation 

Suppose that e {/i,..., //} for some finite I. Let Ki be the number of times fi oeeurs, Yi 
be the empirieal average of {Yk : Xk = fi}, and Y = J2k=i Then, 

k=l 

k=l k=l k=l 

K^f‘^ - 2 +( 2^2 - 2 KM 


K 

1 


K 


2=1 


2=1 


2 = 1 
K 


2 = 1 


E - E + E 

\i=l i=l / k=l 

1 r ^ 

K, (fi - 2f,Y, + 2F,F - F2) + - ^YkY + Y^ 

i=l k=l 

I K 

E iff - 2/i^'i + + 2^^' - Y + E('‘ - 


K 

1 


K 


K 


2=1 


k=l 
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K 

1 

K 


I 


K 


■£ lu (f; - 2m + m - 

_i=l i=\ 

A'. («-?)= + 

i=l i=l 


2%Y + Y‘^)+Y,{Yk-Yf 

k=l 


1 


K 
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